The following analysis investigates factors affecting the quality of white wine, focussing on 11 variables that quantify the chemical properties of wine and an overall quality score.
The details of each variable are available in the attached text file (citations.txt).
After loading in the data, I’ll take a look at the variables within the dataframe:
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
I will also take a look at a summary of the data:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Since from the summary data I know no wine has been scored below 3 or above 9 for quality, I set the limits for the histogram accordingly. The mean (red line), and first and third quartiles (red dashes) have been included showing that with most wines have been scored a 5 or a 6. The median score is also 6.
##
## 3.8 3.9 4.2 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5
## 1 1 2 3 1 1 5 9 7 24 23 28 27 28 31
## 5.6 5.7 5.8 5.9 6 6.1 6.15 6.2 6.3 6.4 6.45 6.5 6.6 6.7 6.8
## 71 88 121 103 184 155 2 192 188 280 1 225 290 236 308
## 6.9 7 7.1 7.15 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2
## 241 232 200 2 206 178 194 123 153 93 93 74 80 56 56
## 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7
## 52 35 32 25 15 18 16 17 6 21 3 11 2 5 4
## 9.8 9.9 10 10.2 10.3 10.7 11.8 14.2
## 8 2 3 1 2 2 1 1
Most wines have between 4g/L-10g/L of fixed acidity. The quartile lines show that the distribution of the data is symmetrical, similar to a normal distribution.
Most wines have volatile acidity between 0.1-0.7 although there are a number of outliers. The mean and median are both between 0.2 and 0.3, with the first and third quartiles around 0.22 and 0.32 respectively.
Most wines have a citric acid level of less than 0.75g/L. The data looks to be farily symmetrical, with the mean and median almost equal.
The data is positively skewed.
##
## 0.6 0.7 0.8 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
## 2 7 25 39 4 93 1 146 3 187 3 147
## 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9
## 2 184 4 142 2 165 2 99 1 99 3 59
## 1.95 2 2.05 2.1 2.2 2.25 2.3 2.35 2.4 2.5 2.6 2.65
## 2 79 1 51 56 2 42 1 41 40 33 1
## 2.7 2.8 2.85 2.9 3 3.1 3.15 3.2 3.3 3.4 3.5 3.6
## 38 36 1 25 17 17 1 28 23 13 31 22
## 3.7 3.75 3.8 3.85 3.9 3.95 4 4.1 4.2 4.25 4.3 4.35
## 12 2 21 3 17 3 19 17 31 2 19 1
## 4.4 4.45 4.5 4.55 4.6 4.7 4.75 4.8 4.85 4.9 5 5.1
## 14 3 33 2 40 29 5 38 1 35 43 28
## 5.15 5.2 5.25 5.3 5.35 5.4 5.45 5.5 5.55 5.6 5.7 5.8
## 2 29 4 17 2 23 2 13 1 16 30 23
## 5.85 5.9 5.95 6 6.1 6.2 6.3 6.35 6.4 6.5 6.55 6.6
## 2 19 1 23 21 31 39 1 34 26 1 30
## 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 7.05 7.1 7.2 7.25
## 3 25 1 28 6 20 1 31 2 36 29 2
## 7.3 7.35 7.4 7.45 7.5 7.6 7.7 7.75 7.8 7.85 7.9 7.95
## 19 2 40 1 30 29 34 2 41 1 32 1
## 8 8.1 8.15 8.2 8.25 8.3 8.4 8.45 8.5 8.55 8.6 8.65
## 32 34 1 36 2 31 13 1 24 1 27 1
## 8.7 8.75 8.8 8.9 8.95 9 9.05 9.1 9.15 9.2 9.25 9.3
## 18 2 22 23 1 18 1 17 2 22 2 11
## 9.4 9.5 9.55 9.6 9.65 9.7 9.8 9.85 9.9 10 10.05 10.1
## 10 9 1 18 4 22 16 3 18 18 3 14
## 10.2 10.3 10.4 10.5 10.55 10.6 10.65 10.7 10.8 10.9 11 11.1
## 23 16 25 16 1 22 1 26 17 11 19 18
## 11.2 11.25 11.3 11.4 11.45 11.5 11.6 11.7 11.75 11.8 11.9 11.95
## 18 2 12 14 1 11 15 8 4 35 16 3
## 12 12.05 12.1 12.15 12.2 12.3 12.4 12.5 12.55 12.6 12.7 12.75
## 16 1 21 4 15 13 19 16 2 16 16 1
## 12.8 12.85 12.9 13 13.1 13.15 13.2 13.3 13.4 13.5 13.55 13.6
## 25 4 25 19 23 1 13 16 7 10 3 12
## 13.65 13.7 13.8 13.9 14 14.05 14.1 14.15 14.2 14.3 14.35 14.4
## 4 21 8 18 16 1 4 1 20 17 3 17
## 14.45 14.5 14.55 14.6 14.7 14.75 14.8 14.9 14.95 15 15.1 15.15
## 3 17 3 13 14 2 12 14 2 13 7 1
## 15.2 15.25 15.3 15.4 15.5 15.55 15.6 15.7 15.75 15.8 15.9 16
## 6 1 9 17 11 6 14 9 1 6 2 10
## 16.05 16.1 16.2 16.3 16.4 16.45 16.5 16.55 16.6 16.65 16.7 16.75
## 6 2 7 7 5 1 3 1 2 5 5 2
## 16.8 16.85 16.9 16.95 17 17.05 17.1 17.2 17.3 17.35 17.4 17.45
## 4 4 3 3 1 1 5 9 14 1 2 2
## 17.5 17.55 17.6 17.7 17.75 17.8 17.85 17.9 17.95 18 18.05 18.1
## 8 3 2 1 4 13 5 2 3 2 3 6
## 18.15 18.2 18.3 18.35 18.4 18.5 18.6 18.75 18.8 18.9 18.95 19.1
## 8 3 2 4 1 1 1 4 3 1 3 1
## 19.25 19.3 19.35 19.4 19.45 19.5 19.6 19.8 19.9 19.95 20.15 20.2
## 3 4 1 2 3 2 1 4 1 3 1 2
## 20.3 20.4 20.7 20.8 22 22.6 23.5 26.05 31.6 65.8
## 1 1 2 2 2 1 1 2 2 1
Taking the log10 of the data, the data looks to be bimodal with a peak at around 0.25g/L and another at around 1g/L.
There are also a number of outliers in the data for chlorides.
##
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.02 0.021 0.022
## 1 1 1 4 4 5 5 10 9 16 19 19
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 0.034
## 20 34 30 54 58 85 81 108 107 109 119 168
## 0.035 0.036 0.037 0.038 0.039 0.04 0.041 0.042 0.043 0.044 0.045 0.046
## 130 200 160 167 157 182 147 184 141 201 170 181
## 0.047 0.048 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058
## 171 174 133 170 115 104 130 99 61 88 68 53
## 0.059 0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069 0.07
## 36 46 19 25 23 15 8 18 18 7 18 6
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079 0.08 0.081 0.082
## 5 2 5 8 2 9 1 2 4 4 2 2
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089 0.09 0.091 0.092 0.093 0.094
## 5 5 3 4 3 2 1 2 1 3 3 5
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108 0.11 0.112 0.114
## 2 6 1 3 1 1 1 1 2 3 1 1
## 0.115 0.117 0.118 0.119 0.12 0.121 0.122 0.123 0.126 0.127 0.13 0.132
## 1 3 1 3 1 2 1 4 3 2 1 1
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149
## 1 1 1 2 2 3 1 1 1 2 1 1
## 0.15 0.152 0.154 0.156 0.157 0.158 0.16 0.167 0.168 0.169 0.17 0.171
## 1 2 1 1 4 1 2 2 3 2 2 1
## 0.172 0.173 0.174 0.175 0.176 0.179 0.18 0.184 0.185 0.186 0.194 0.197
## 2 2 2 2 2 1 1 2 2 1 1 2
## 0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239 0.24 0.244 0.255
## 1 2 1 2 1 1 1 1 1 1 1 1
## 0.271 0.29 0.301 0.346
## 1 1 1 1
Most wines have a chloride level of between 0.03g/L-0.06g/L.
##
## 2 3 4 5 6 7 8 9 10 11 11.5 12
## 1 10 11 25 32 25 35 29 55 45 1 51
## 13 14 15 15.5 16 17 18 19 19.5 20 21 22
## 55 68 79 1 58 89 80 84 1 101 93 102
## 23 23.5 24 25 26 27 28 28.5 29 30 30.5 31
## 110 1 118 111 129 99 112 1 160 99 1 132
## 32 33 34 35 35.5 36 37 38 38.5 39 39.5 40
## 109 112 128 129 2 127 111 102 1 89 1 103
## 40.5 41 41.5 42 42.5 43 43.5 44 44.5 45 46 47
## 1 104 2 86 1 63 1 75 4 101 64 91
## 48 48.5 49 50 50.5 51 51.5 52 52.5 53 54 55
## 66 7 82 64 2 54 1 72 4 68 61 58
## 56 57 58 59 59.5 60 60.5 61 61.5 62 63 64
## 42 44 37 39 2 38 2 47 1 29 30 23
## 64.5 65 66 67 68 69 70 70.5 71 72 73 73.5
## 1 14 17 22 24 17 11 1 5 6 8 4
## 74 75 76 77 77.5 78 79 79.5 80 81 82 82.5
## 5 7 5 5 1 4 2 4 1 7 2 1
## 83 85 86 87 88 89 93 95 96 97 98 101
## 4 2 2 4 1 1 1 1 3 1 3 2
## 105 108 110 112 118.5 122.5 124 128 131 138.5 146.5 289
## 2 3 1 1 1 1 1 1 1 1 1 1
Changing the bin width and excluding the outlier gives the following graph:
The typical range for free sulfur dioxide in wine is between 10mg/L-70mg/L.
In general, total sulfur dioxide ranges between 75mg/L-225mg/L, with a few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density appears to be fairly volatile, most of the wine within a fairly narrow range between 0.989 and 0.996.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The pH level for the wine is typically between 3 and 3.5. Looking at the quartiles and the median, the data looks to be fairly symmetrical with the mean pH level 3.188 and the median 3.18.
The levels of sulphates are slightly positively skewed, generally the range is between 0.3g/L-0.6g/L for most wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
##
## 8 8.4 8.5 8.6
## 2 3 9 23
## 8.7 8.8 8.9 9
## 78 107 95 185
## 9.1 9.2 9.3 9.4
## 144 199 134 229
## 9.5 9.53333333333333 9.55 9.6
## 228 3 2 128
## 9.63333333333333 9.7 9.73333333333333 9.75
## 1 105 2 1
## 9.8 9.9 10 10.0333333333333
## 136 109 162 1
## 10.1 10.1333333333333 10.15 10.2
## 114 2 3 130
## 10.3 10.4 10.4666666666667 10.5
## 85 153 2 160
## 10.5333333333333 10.55 10.5666666666667 10.6
## 1 2 1 114
## 10.65 10.7 10.8 10.9
## 1 96 135 88
## 10.9333333333333 10.9666666666667 10.98 11
## 2 3 1 158
## 11.05 11.0666666666667 11.1 11.2
## 2 1 83 112
## 11.2666666666667 11.3 11.3333333333333 11.35
## 1 101 3 1
## 11.3666666666667 11.4 11.4333333333333 11.45
## 1 121 1 4
## 11.4666666666667 11.5 11.55 11.6
## 1 88 1 46
## 11.6333333333333 11.65 11.7 11.7333333333333
## 2 1 58 1
## 11.75 11.8 11.85 11.9
## 2 60 1 53
## 11.94 11.95 12 12.05
## 2 1 102 1
## 12.0666666666667 12.1 12.15 12.2
## 1 51 2 86
## 12.25 12.3 12.3333333333333 12.4
## 1 62 1 68
## 12.5 12.6 12.7 12.75
## 83 63 56 3
## 12.8 12.8933333333333 12.9 13
## 54 2 39 36
## 13.05 13.1 13.1333333333333 13.2
## 1 18 1 14
## 13.3 13.4 13.5 13.55
## 7 20 12 1
## 13.6 13.7 13.8 13.9
## 9 7 2 3
## 14 14.05 14.2
## 5 1 1
While the median alcohol percentage is 10.4%, the mean is 10.5% and for most of the wine, the alcohol levels are between 9%-12%.
While volatile acidity should be kept as low as possible, overall acidity is important in wine, too high can lead to a sour tasting wine, and too low will result in a flat taste.
Therefore, I have decided to look at total acidity in the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.130 6.890 7.405 7.467 7.960 14.960
The plot looks fairly symmetrical, similar to a normal distribution however there are a few outliers.
Since acidity can counterbalance sweetness in a wine, I would like to look at this ratio.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1423 0.7768 1.3840 2.5330 4.2560 15.4800
There are 4898 different wines in the dataset, and 12 variables have been examined: -fixed acidity -volatile acidity -citric acid -residual sugar -chlorides -free sulfur dioxide -total sulfur dioxide -density -pH -sulphates -alcohol -quality
All variables are floating point numbers with the exception of quality which is an integer. The wine in the dataset has been given a quality rating from 0 (worst) to 10 (best).
Other observations: 1. The spread of the quality of the white wine resembles a normal distribution with a peak of 6. 2. Acidity (fixed and volatile), sulphur levels (total and free) and pH levels also display characteristics of a normal distribution. 3. Alcohol content seems to be fairly evenly distributed between the 9%-12% level, with the exception of around 10% which seems to have a higher frequency. 4. Residual sugar levels are relatively low for the majority of the sample, between 0g/L-2g/L 5. There is wine with the residual sugar level of 65.8g/L, which seems to be an outlier. 6. There are also outliers with high levels of chlorides, acidity (free and fixed) and free sulfur dioxide. 7.Both fixed acidity and volatile acidity have clear peaks.
I included a total acidity variable, adding together fixed acidity and citric acid with volatile acidity. While levels of volatile acidity should be kept as low as possible to avoid fermentation, the balance of overall acidity could determine whether a particular wine is flat tasting (acidity too low), or sour (acidity too high).
I have also looked at the ratio of total acidity to residual sugar. Perhaps there is an optimum level?
I log transformed the left skewed residual sugar distribution. This resulted in a bimodal graph with peaks at 0.25g/L and 1g/L.
The main feature of this dataset is quality. I will be examining factors that affect the quality of wine.
In order to determine which particular variables I will look at more closely, I would first like to examine correlation data.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## quality 0.035763247 -0.11366283 -0.194722969
## total.acidity -0.263216664 0.98717874 0.071570617
## aciditySugar.ratio -0.062868153 0.11208572 -0.105308868
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## alcohol -0.075728730 -0.450631222 -0.36018871
## quality -0.009209091 -0.097576829 -0.20993441
## total.acidity 0.394143356 0.104737493 0.04552987
## aciditySugar.ratio 0.022546906 -0.764289501 -0.04688074
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## quality 0.0081580671 -0.174737218 -0.30712331
## total.acidity -0.0451333172 0.113188502 0.27560881
## aciditySugar.ratio -0.2888136640 -0.369038808 -0.57164873
## pH sulphates alcohol quality
## X -0.1157741316 0.0098077589 0.21365624 0.035763247
## fixed.acidity -0.4258582910 -0.0171429850 -0.12088112 -0.113662831
## volatile.acidity -0.0319153683 -0.0357281469 0.06771794 -0.194722969
## citric.acid -0.1637482114 0.0623309403 -0.07572873 -0.009209091
## residual.sugar -0.1941334540 -0.0266643659 -0.45063122 -0.097576829
## chlorides -0.0904394560 0.0167628837 -0.36018871 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.0592172458 -0.25010394 0.008158067
## total.sulfur.dioxide 0.0023209718 0.1345623669 -0.44889210 -0.174737218
## density -0.0935914935 0.0744931485 -0.78013762 -0.307123313
## pH 1.0000000000 0.1559514973 0.12143210 0.099427246
## sulphates 0.1559514973 1.0000000000 -0.01743277 0.053677877
## alcohol 0.1214320987 -0.0174327719 1.00000000 0.435574715
## quality 0.0994272457 0.0536778771 0.43557472 1.000000000
## total.acidity -0.4306513315 -0.0118522486 -0.11751272 -0.131377207
## aciditySugar.ratio 0.0548562103 -0.0008598056 0.25832261 -0.013264024
## total.acidity aciditySugar.ratio
## X -0.26321666 -0.0628681526
## fixed.acidity 0.98717874 0.1120857243
## volatile.acidity 0.07157062 -0.1053088681
## citric.acid 0.39414336 0.0225469064
## residual.sugar 0.10473749 -0.7642895014
## chlorides 0.04552987 -0.0468807358
## free.sulfur.dioxide -0.04513332 -0.2888136640
## total.sulfur.dioxide 0.11318850 -0.3690388081
## density 0.27560881 -0.5716487284
## pH -0.43065133 0.0548562103
## sulphates -0.01185225 -0.0008598056
## alcohol -0.11751272 0.2583226123
## quality -0.13137721 -0.0132640236
## total.acidity 1.00000000 0.0976389289
## aciditySugar.ratio 0.09763893 1.0000000000
Since quality seems to correlate with alcohol, density, chlorides and volatile acidity, I will initially be focussing on these variables.
Adjusting the plot slightly:
It may be more useful to look at boxplots and a summary for each quality rating.
## whitewine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
## --------------------------------------------------------
## whitewine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## whitewine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## whitewine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## whitewine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## whitewine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## whitewine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
There does seem to be a pattern, particularly among the better quality wines (quality score of 5+)
Overlaying with a scatterplot, and excluding the outliers:
Density seems to decrease as quality increases.
Two further relationships I would like to look at in more detail are density and residual sugar levels and density and alcohol levels.
This shows that in general, the higher the residual sugar level is, the more dense the wine is.
This shows that the higher the alcohol percentage is, the less dense the wine is.
## whitewine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
## --------------------------------------------------------
## whitewine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
## --------------------------------------------------------
## whitewine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
## --------------------------------------------------------
## whitewine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## whitewine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## whitewine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## whitewine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
This shows an inverse relationship between density and wine quality.
## whitewine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1700 0.2375 0.2600 0.3332 0.4125 0.6400
## --------------------------------------------------------
## whitewine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1100 0.2700 0.3200 0.3812 0.4600 1.1000
## --------------------------------------------------------
## whitewine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.240 0.280 0.302 0.340 0.905
## --------------------------------------------------------
## whitewine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## whitewine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
## --------------------------------------------------------
## whitewine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.2000 0.2600 0.2774 0.3300 0.6600
## --------------------------------------------------------
## whitewine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.240 0.260 0.270 0.298 0.360 0.360
The line graph of mean volatile acidity for each quality score and the boxplots show a non-linear pattern. With the wine that has a quality score of 6 or more the volatile acidity actually increases.
The above graph looks like it could suggest the lower the chloride level, the higher the quality. Looking at box plots for each quality rating, we should be able to get more detail:
## whitewine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## whitewine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## whitewine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## whitewine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## whitewine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## whitewine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## whitewine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
The boxplot above shows that the better wines (5+) have decreasing amounts of chlorides.
In general, it looks as though the lower the acidity is the better the wine is, however wine with a quality rating of 9 seems to go against the general pattern, and has an increased level of acidity- tota, citric and fixed.
This shows no clear pattern.
For most quality scores, the free sulfur dioxide levels remain steady. The notable exceptions are 3,4 and 9.
The boxplot above shows in general, decreasing levels of total sulfur dioxide relates to higher quality wines.
Higher pH seems to relate to a higher quality wine.
Overall, due to the fact that a lot of wines have been given a score of 6, the scatterplots suffered from overplotting.
Quality correlates with alcohol, with the mean and median alcohol percentage increasing with the quality scores.
Density appears to be negatively correlated with quality, with both the mean and the median density decreasing as the quality score increases.
In terms of volatile acidity, there appears to be a non-linear relationship with quality, between quality scores of between 4-6 the volatile acidity decreases as quality increases, and between 6-9 the volatile acidity level increases as quality increases.
Similarly, in terms of chlorides, wine with a quality score of 5 or above has decreasing levels of chlorides as quality increases.
From the bivariate plots, it looks like the mean numbers for better wines ie those with a score of 5 or above display different characteristics to those with lower scores.
Additionally I looked at density in further detail, looking at its relationship with alcohol and with residual sugar. Density increases with residual sugar levels and decreases with alcohol.
The strongest relationships I found in the data were with alcohol and with density in relation to the quality of wine. Chlorides and pH also seem to relate to the quality score.
At this point I’d like to cut the data into ‘bad’, ‘good’ and ‘best’ buckets according to their quality ratings: bad: 3-4 good:5-6 best:7-9
This should make the multivariate plots easier to read.
The above plot highlights the positive correlation between density and residual sugar. The higher quality wines appear to be less dense for a given residual sugar level compared with the lower quality wines.
Looking at the ratio of density to residual sugar, there is a loose positive correlation between this ratio and better quality wines. I will now look at this ratio against alcohol.
There does seem to be a vague pattern with the data, the good wines seem to have a higher alcohol level and lower density/residual sugar ratio.
I should be able to find a stronger pattern in the data.
I will now take a closer look at the relationship between density and alcohol in relation to quality.
As already discussed, the higher quality wines tend to be less dense but I want to take a look at alcohol in more detail in relation to quality:
Better wines tend to have a higher alcohol level.
Looking at the density/alcohol ratio more closely, we get the following boxplots:
For the wines with a score of 5 or more, this shows there is a strong relationship between the density/alcohol ratio and the quality score, with few outliers.
I would also like to look at the relationship between chlorides and the density/alcohol ratio in relation to quality:
The above graph indicates the better wines have lower levels of chlorides and a lower density/alcohol ratio.
Intuitively, I would think that pH and acidity would be related and would have an effect on the quality of wine.
Looking at the scatterplot there does not seem to be a significant relationship between pH and volatile acidity, and quality. The density plot shows most of the wine fits into a narrow range volatile acidity levels. However the pH density plot shows a vague pattern of increasing pH leading to better wine quality.
Since I know from the previous section that pH and quality do seem to be correlated, I will now look at its relationship with alcohol levels and with density:
The boxplots show that there is a negative relationship with this ratio (pH/density/alcohol) and wine quality.
I will now look at the density/alcohol ratio against other variables:
Density/alcohol vs fixed acidity:
Density/alcohol vs citric acid:
Density/alcohol vs free sulfur dioxide:
Density/alcohol vs total sulfur dioxide:
Density/alcohol vs sulphates:
The above scatterplots show the good wines tend to have lower levels of sulphates, total and free sulfur dioxide, however among the ok wines, there seems to be no clear pattern. The boxplots also show no clear pattern across the different qualities of wine.
##
## Calls:
## m1: lm(formula = quality ~ I(pH/density/alcohol), data = whitewine)
## m2: lm(formula = quality ~ I(pH/density/alcohol) + chlorides, data = whitewine)
## m3: lm(formula = quality ~ I(pH/density/alcohol) + chlorides + I(log10(residual.sugar)),
## data = whitewine)
##
## =============================================================
## m1 m2 m3
## -------------------------------------------------------------
## (Intercept) 8.805*** 8.746*** 8.807***
## (0.104) (0.103) (0.104)
## I(pH/density/alcohol) -9.477*** -8.671*** -9.147***
## (0.334) (0.349) (0.367)
## chlorides -4.151*** -4.087***
## (0.562) (0.561)
## I(log10(residual.sugar)) 0.128***
## (0.031)
## -------------------------------------------------------------
## R-squared 0.1 0.2 0.2
## adj. R-squared 0.1 0.2 0.2
## sigma 0.8 0.8 0.8
## F 806.9 435.2 296.9
## p 0.0 0.0 0.0
## Log-likelihood -5981.0 -5953.8 -5945.1
## Deviance 3297.5 3261.2 3249.6
## AIC 11968.0 11915.7 11900.2
## BIC 11987.5 11941.7 11932.7
## N 4898 4898 4898
## =============================================================
The variables in the linear model only account for 0.2% of the variation in quality scores.
The relationship between alcohol and density was a relatively strong one in terms of determining the quality, this was further strengthened by including pH. A low pH/density/alcohol ratio looked like it resulted in a better quality wine. However, this relationship was not strong enough to build a linear model.
After researching measures of wine quality on the internet, I expected to find pH (the ‘backbone of wine’), acidity and residual sugar to be significant factors in determining the quality of wine. Intuitively this seems to make sense since these aspects are more easily detectable. However my analysis found that density and alcohol level played a big role in determining the quality of wine.
Further, I expected free and/or total sulfur dioxide levels to also play a significant role in determining the quality of wine, since larger levels should be easily detectable, however this did not appear to be the case.
I tried to create a linear model of quality vs pH/density/alcohol, however this did not yield any significant results. this ratio only accounted for 20% of the variance in the quality score.
I did create a new variable ‘quality.bucket’ in order to group the data by ranges of quality, making it easier to analyse the plots.
## whitewine$quality: 3
## [1] 20
## --------------------------------------------------------
## whitewine$quality: 4
## [1] 163
## --------------------------------------------------------
## whitewine$quality: 5
## [1] 1457
## --------------------------------------------------------
## whitewine$quality: 6
## [1] 2198
## --------------------------------------------------------
## whitewine$quality: 7
## [1] 880
## --------------------------------------------------------
## whitewine$quality: 8
## [1] 175
## --------------------------------------------------------
## whitewine$quality: 9
## [1] 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Of the 4898 bottles of wine in the dataset, 2198 have been given a quality score of 6, and 1457 have been given a quality score of 5, therefore the majority (75%) of the wine is mediocre. Since the median is equal to the third quartile, this also shows how heavily weighted the data is. Overall the data is slightly positively skewed, and there are no observations that were given a score of 1, 2 or 10.
Since the quality of wine seems to involve a delicate balance of a number of different variables, not limited to those found in this dataset, my guess is that this distribution of scores is typical, it is rare to find wines that are exceptional, or exceptionally bad.
## whitewine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
## --------------------------------------------------------
## whitewine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
## --------------------------------------------------------
## whitewine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
## --------------------------------------------------------
## whitewine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## whitewine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## whitewine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## whitewine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
The mean density decreases as the quality of the wine decreases. Overall a large proportion of wine with a score of 7, 8 and 9 seems to be less dense than the wine with a lower score.
Although the general pattern is increasing quality for decreasing density, the exception to this rule is wine with a quality score of 5, where the mean density is actually higher than the mean density for wine with a quality score of 4.
The general pattern of decreasing density for increasing quality makes sense for white wine, since you would expect good wine to be light in density (although this may be my personal preference also!).
Boxplots show summary data of the relationship between the pH/Density/Alcohol ratio and the quality score given to wine, the scatterplot underneath shows the detail behind the summary plot. The plot shows a tendency towards a lower ratio resulting in a higher quality score. The ratio does seem to show a loose pattern however I believe looking at just three factors in the quality of wine may be oversimplifying the concept.
The summary data is as follows:
## whitewine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2604 0.2925 0.3066 0.3137 0.3194 0.4288
## --------------------------------------------------------
## whitewine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2427 0.2903 0.3182 0.3183 0.3419 0.3897
## --------------------------------------------------------
## whitewine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2294 0.3109 0.3302 0.3267 0.3456 0.4203
## --------------------------------------------------------
## whitewine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2159 0.2800 0.3091 0.3067 0.3340 0.4097
## --------------------------------------------------------
## whitewine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2174 0.2614 0.2837 0.2882 0.3121 0.3828
## --------------------------------------------------------
## whitewine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2219 0.2570 0.2740 0.2821 0.3028 0.3812
## --------------------------------------------------------
## whitewine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2609 0.2638 0.2649 0.2752 0.2779 0.3086
The dataset contains 4898 different variants of white wine, examined using 11 input variables and a quality score. After researching each variable and its expected affects/significance to the quality of wine, I began my analysis by looking at each variable in turn in order to get an overview of the spread of the data. I then used correlation figures as a starting point to look more closely at cross variable patterns and eventually focussed on the pH/Density/Alcohol ratio which seemed to show a relatively strong link to the quality score.
Whilst I did try and create a linear model, the data did not show a strong enough correlation between variables to create an adequate model. In many instances wine with a quality rating of 3 or 4 seemed to show similar levels of composition compared with the wines that scored 8 or 9 and wine with a quality rating of 5 or 6 had too high a variance across the input variables to display a discernible pattern.
Overall, as mentioned earlier the plots suffer from overplotting, particularly with wine given a score of 5 or 6.
Researching the input variables showed that quality wine has a delicate balance of a number of variables, more than those covered by this dataset. The next steps would be to look at more variables, and look at interactions between variables in a higher level of detail.